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Abstract 



u 

Finding the product of two polynomials is an essential and basic prob- 
^ lem in computer algebra. While most previous results have focused on 

' ' the worst-case complexity, we instead employ the technique of adaptive 

analysis to give an improvement in many "easy" cases. We present two 
adaptive measures and methods for polynomial multiplication, and also 
show how to effectively combine them to gain both advantages. One use- 
ful feature of these algorithms is that they essentially provide a gradient 
\^ between existing "sparse" and "dense" methods. We prove that these ap- 

proaches provide significant improvements in many cases but in the worst 
case are still comparable to the fastest existing algorithms. 

o 

2 1 Introduction 

. 5^ Computing the product of two polynomials is one of the most important prob- 

lems in symbolic computation, and the operation is part of the basic functional- 
\—{ ity of any computer algebra system. We introduce nevi^ multiplication algorithms 

which use the technique of adaptive analysis to gain improvements compared to 
existing approaches both in theory and in practice. 



1.1 Background 

For what follows, R is an arbitrary ring (commutative, with identity), such that 
ring elements have unit storage and basic ring operations have unit cost. In 
complexity estimates, we also count operations on word-sized integers^ which 
are assumed only to be large enough (in absolute value) to store the size of the 
input. 
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There are essentially two representations for univariate polynomials over R, 
and existing algorithms for multiplication require one of these representations. 
Let / G R[a;] with degree less than n written as 

f = CQ + cix + C2X^ ^ hc„_ia;""\ (1.1) 

for Co, ... , c„_i G R. The dense representation of / is simply an array [cq, Ci, . . . , c„_i] 
of length n. 

Next, suppose that at most t of the coefficients are nonzero, so that we can 
write 

/ = aix"' + a2x''^ +■■■ + atx^' , (1.2) 

for oi, . . . , at G R and < ei < • • • < ej. Hence — Ce- for 1 < i < i, and 
in particular et = deg/. The sparse representation of / is a list of coefficient- 
exponent tuples (ai, ei), . . . , (at, et). The exponents in this case could be multi- 
precision integers, and so the total size of the sparse representation is propor- 
tional to + log2 This is bounded below by il(i logt + logn) and above 
by 0{t\ogn). 

Algorithmic advances in dense polynomial multiplication have generally fol- 
lowed results for long integer multiplication. The 0{n^) school method was first 
improved by Karatsuba and Ofman [1963] to 0{n}^^^) with a two-way divide- 
and-conquer scheme, later generalized to fc-way by Toom [1963] and Cook [1966]. 
Schonhage and Strassen [1971] developed the first pseudo-linear time algorithm 
for integer multiplication with cost O(nlognloglogn); this is also the cost of 
the fastest known algorithm for polynomial multiplication [Cantor and Kaltofen, 
1991]. 

In practice, all of these algorithms will be used in certain ranges, and so we 
employ the usual notation of a multiplication time function M(n), the cost of 
multiplying two dense polynomials with degrees both less than n. Also define 
5[n) = M(n)/n. If /, g G R[a;] with different degrees deg / < n, degg < m, and 
n > m, by splitting / into \n/rn\ size-m blocks we can compute the product 
/ • g with cost 0(^M(m)), or 0{n ■ S{m)). 

For the multiplication of two sparse polynomials as in (1.2), the school 
method uses 0(t^) ring operations, which cannot be improved in the worst 
case. However, since the degrees could be very large, the cost of exponent 
arithmetic becomes significant. The school method uses 0{t^\ogn) word op- 
erations and 0{t^) space. Yan [1998] reduces the number of word operations 
to O(t^logHogn) with the "geobuckets" data structure. Finally, recent work 
by Monagan and Pcarcc [2007], following Johnson [1974], gets this same time 
complexity but reduces the space requirement to 0{t-\-r), where r is the number 
of nonzero terms in the product. 

The algorithms we present are for univariate polynomials. They can also be 
used for multivariate polynomial multiplication by using Kronecker substitution: 
Given two n-variate polynomials f,g(z R[a;i, . . . , x„] with max degrees less than 
d, substitute Xi = y^^'*) for 1 < i < n, multiply the univariate polynomials 
over R[y], then convert back. Many other representations exist for multivariate 
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polynomials [see Fatcman, 2002], but wc will not compare with them or consider 
them fort her. 

1.2 Overview of Approach 

The performance of an adaptive algorithm depends not only on the size of the 
input but also on some inherent difficulty measure. Such algorithms match 
standard approaches in their worst-case performance, but perform far better 
on many instances. This idea was first applied to sorting algorithms and has 
proved useful both in theory and in practice [see Petersson and Moffat, 1995]. 
Such techniques have also proven useful in symbolic computation, for example 
the early termination strategy of Kaltofen and Lee [2003] . 

Hybrid algorithms combine multiple different approaches to the same prob- 
lem to effectively handle more cases [e.g. Duran et al., 2003]. Our algorithms 
are also hybrid in the sense that they provide a smooth gradient between ex- 
isting sparse and dense multiplication algorithms. The adaptive nature of the 
algorithms means that in fact they will be faster than existing algorithms in 
many cases, while never being (asymptotically) slower. 

The algorithms we present will always proceed in three stages. First, the 
polynomials are read in and converted to a different representation which effec- 
tively captures the relevant measure of difficulty. Second, we multiply the two 
polynomials in the alternate representation. Finally, the product is converted 
back to the original representation. 

The computational cost of the second step (where the multiplication is actu- 
ally performed) depends on the difficulty of the particular instance. Therefore 
this step should be the dominating cost of the entire algorithm, and in par- 
ticular the cost of the conversion steps must be linear in the size of the input 
polynomials. 

In Section 2, we give the first idea for adaptive multiplication, which is to 
write a polynomial as a list of dense "chunks" . The second idea, presented in 
Section 3, is to write a polynomial with "equal spacing" between coefficients 
as a dense polynomial composed with a power of the indeterminate. Section 4 
shows how to combine these two ideas to make one algorithm which effectively 
captures both difficulty measures. Finally, a few conclusions and ideas for future 
directions are discussed in Section 5. 

Preliminary progress on some of these results was presented at the Milestones 
in Computer Algebra (MICA) conference held in Tobago in May 2008 [Roche, 
2008]. 

2 Chunky Polynomials 

The basic idea of chunky multiplication is a straightforward combination of the 
standard sparse and dense representations, providing a natural gradient between 
the two approaches for multiplication. We note that a similar idea was noticed 
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(independently) around the same time by Fatcman [2008, page 11], although 
the treatment here is much more extensive. 

For / S R[x] of degree n, the chunky representation of / is a sparse polyno- 
mial with dense polynomial "chunks" as coefficients: 

f ^ fix'' + f2x'' + ■ ■ ■ + ftx^\ (2.1) 

with fi e R[a;] and e N for each 1 < i < t. We require only that e.i+i > 
Ci + deg/j for 1 < z < t — 1, and each fi has nonzero constant coefficient. 

Recall the notation introduced above of S{n) — M{n)/n. A unique feature 
of our approach is that we will actually use this function to tune the algorithm. 
That is, we assume a subroutine is given to evaluate 6{n) for any chosen value 
n. 

If n is a word-sized integer, then the computation of (5(n) must use a constant 
number of word operations. If n is more than word-sized, then we are asking 
about the cost of multiplying two dense polynomials that cannot fit in memory, 
so the subroutine should return cxi in such cases. Practically speaking, the 6{n) 
evaluation will usually be an approximation of the actual value, but for what 
follows we assume the computed value is always exactly correct. 

Furthermore, we require S(n) to be an increasing function which grows more 
slowly than linearly, meaning that for any a, 6, d G N with a < b, 

d{a + d)-S{a)>S{b + d)-dib). (2.2) 

These conditions are clearly satisfied for all the dense multiplication algorithms 
and corresponding M(n) functions discussed above, including the piecewise func- 
tion used in practice. 

The conversion of a sparse or dense polynomial to the chunky representation 
proceeds in two stages: first, we compute an "optimal chunk size" fc, and then we 
use this computed value as a parameter in the actual conversion algorithm. The 
product of the two polynomials is then computed in the chunky representation, 
and finally the result is converted back to the original representation. The steps 
are presented in reverse order in the hope that the goals at each stage are more 
clear. 

2.1 Multiplication in the chunky representation 

Multiplying polynomials in the chunky representation uses sparse multiplication 
on the outer loop, treating each dense polynomial chunk as a coefficient, and 
dense multiplication to find each product of two chunks. 

For /, g e R[a;] to be multiplied, write / as in (2.1) and g as 

g ^ gix""' + g2x''-' + ■ ■ ■ + gsx''^ (2.3) 

with s G N and similar conditions on each gi £ R[x] and G N as in (2.1). 
Without loss of generality, assume also that t > s, that is, / has more chunks 
than g. To multiply / and g, we need to compute each product figj for 1 < i < t 
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and 1 < j < s and put the resulting chunks into sorted order. It is likely that 
some of the chunk products will overlap, and hence some coefficients will also 
need to be summed. 

By using heaps of pointers as in Monagan and Pearcc [2007], the chunks of 
the result are computed in order, eliminating unnecessary additions and using 
little extra space. A min-heap of size s is filled with pairs for i,j £ N, and 

ordered by the corresponding sum of exponents Ci + dj . Each time we compute 
a new chunk product fi ■ gj, we check the new exponent against the degree of 
the previous chunk, in order to determine whether to make a new chunk in the 
product or add to the previous one. The details of this approach are given in 
Algorithm 1. 



Algorithm 1: Chunky Multiplication 



Input: f,g as in (2.1) and (2.3) 

Output: The product f ■ g = h in the chunky representation 

1 a fi ■ gi using dense multiplication 

2 5 -s— ei + di 

3 -s— min-hcap with pairs (1, j) for j = 2, 3, . . . , s, ordered by exponent 
sums 

4 if i > 2 then insert (2, 1) into H 

5 while H is not empty do 

6 
7 
8 
9 
10 



^ pair from top of H 
(3 ^ fi ■ gj using dense multiplication 
if & + deg a < Ci + dj then 
write ax'' as next term of h 
a /3; b Bi + dj 

else a ^ a + ^aj^i+'^j-'' stored as a dense polynomial 
if i < t then insert {i + into H 

13 write ax'' as final term of h 
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After using this algorithm to multiply / and g, we can easily convert the 
result back to the dense or sparse representation in linear time. In fact, if the 
output is dense, we can preallocate space for the result and store the computed 
product directly in the dense array, requiring only some extra space for the heap 
H and a single intermediate product h^cw 

Theorem 2.1. Algorithm 1 correctly computes the product of f and g using 
o( (deg/.)-'5(degg,) + ^ (deg^,) • <5(deg /.)) 

dcg/i>dcgg3 dcg/i<dcggj 
l<j<t, l<j<s i<i<t, l<j<s 

ring operations and 0(ts ■ logs • \og(degfg)) word operations. 
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Proof. Correctness is clear from the definitions. The bound on ring operations 
comes from Step 7 using the fact that S{n) = M{n)/n. The cost of additions on 
Step 11 is linear and hence also within the stated bound. 

The cost of word operations is incurred in removing from and inserting to 
the heap on Steps 6 and 12. Because these steps are executed no more than 
tftg times, the size of the heap is never more than tg, and each exponent sum 
is bounded by the degree of the product, the stated bound is correct. □ 

Notice that the cost of word operations is always less than the cost would be 
if we had multiplied / and g in the standard sparse representation. We therefore 
focus only on minimizing the number of ring operations in the conversion steps 
that follow. 

2.2 Conversion given optimal chunk size 

The general chunky conversion problem is, given f,g(z R[a;], both either in the 
sparse or dense representation, to determine chunky representations for / and 
g which minimize the cost of Algorithm 1. Here we consider a simpler problem, 
namely determining an optimal chunky representation for / given that g has 
only one chunk of size k. 

The following corollary comes directly from Theorem 2.1 and will guide our 
conversion algorithm on this step. 

Corollary 2.2. Given f S R[a;] as in (2.1), the number of ring operations 
required to multiply f by a single dense polynomial with degree less than k is 



For any high-degree chunk (i.e. deg/i > fc), we see that there is no benefit 
to making the chunk any larger, as the cost is proportional to the sum of the 
degrees of these chunks. In order to minimize the cost of multiplication, then, 
we should not have any chunks with degree greater than k (except possibly in 
the case that every coefficient of the chunk is nonzero) , and we should minimize 
^ i5(deg/i) for all chunks with size less than k. 

These observations form the basis of our approach in Algorithm 2 below. 
For an input polynomial / e R[x], each "gap" of consecutive zero coefficients 
in / is examined, in order. We determine the optimal chunky conversion if the 
polynomial were truncated at that gap. This is accomplished by finding the 
previous gap of highest degree that should be included in the optimal chunky 
representation. We already have the conversion for the polynomial up to that 
gap (from a previous step) , so we simply add on the last chunk and we are done. 
At the end, after all gaps have been examined, we have the optimal conversion 
for the entire polynomial. 

Let ai,bi G 1 for < z < m be the sizes of each consecutive "gap" of zero 
coefficients and "block" of nonzero coefficients, in order. Each and bi will 
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be nonzero except possibly for oq (if / has a nonzero constant coefficient), and 
J2o<i<mi^i + ^i) = deg/ + 1. For example, the polynomial 

/ = 5x^" + 3a;" + 9x^^ + 20x^^ + ix^" + 8x'^^ 

has oo = 10, 6o = 2, ai = 1, 5i = 1, a2 = 5, and 62 — 3. Also define di to be the 
degree of the polynomial up to (not including) gap i, i.e. di = J2o<j<ii'^j 

For the gap at index £, for 1 < £ < m, we store the optimal chunky conversion 
of / mod x*** by a linked list of indices of all gaps in / that should also be gaps 
between chunks in the optimal chunky representation. In cg we also store 1/k 
times the cost, in ring operations, of multiplying / mod x'^'^ (in this optimal 
representation) by a single chunk of size k. 

When examining the gap at index £, in order to determine the previous 
gap of highest degree to be included in the optimal chunky representation if 
the polynomial were truncated at gap j, we need to find the index i < £ that 
minimizes Ci + 6{de — di) (indices i where di — di > k need not be considered, as 
discussed above). From (2.2), we know that, if 1 < i < j < ^ and Ci+S{di — di) < 
Cj + S{di — dj), then this same inequality continues to hold as £ increases. That 
is, as soon as an earlier gap results in a smaller cost than a later one, that earlier 
gap will continue to beat the later one. 

Thus we can essentially precompute the values of mini<^(ci + d{de — di)) by 
maintaining a stack of index-index pairs. A pair (?, j) of indices indicates that 
Ci + d{d£ — di) is minimal as long as £ < j. The second pair of indices indicates 
the minimal value from gap j to the gap of the second index of the second pair, 
and so forth up to the bottom of the stack and the last gap. 

The details of this rather complicated algorithm are given in Algorithm 2. 

For an informal justification of correctness, consider a single iteration through 
the main for loop. At this point, we have computed all optimal costs Ci , C2 , . . . , Q-i , 
and the lists of gaps to achieve those costs Li, L2, . . . , L^^i. We also have com- 
puted the stack S, indicating which of the gaps up to index £ ~2 is optimal and 
when. 

The while loop on Step 3 removes all gaps from the stack which are no 
longer relevant, either because their cost is now beaten by a previous gap (when 
j < £), or because the size of the resulting chunk would be greater than k and 
therefore unnecessary to consider. 

If the condition of Step 5 is true, then there is no index at which gap {£— 1) 
should be used, so we discard it. 

Otherwise, the gap at index £ — 1 is good at least some of the time, so we 
proceed to the task of determining the largest gap index v at which gap {£ — 1) 
might still be useful. First, in Steps 10-12, we repeatedly check whether gap 
{£— 1) always beats the gap at the top of the stack S, and if so remove it. After 
this process, either no gaps remain on the stack, or we have a range r < v < j 
in which binary search can be performed to determine v. 

From the definitions, dm+i = deg f + l, and so the list of gaps -Lm+i returned 
on the final step gives the optimal list of gaps to include in / mod x'^'^^^^^, which 
is of course just / itself. 



7 



Algorithm 2: Chunky Conversion Algoritlim 



Input: fc e N, / € R[x], and integers a^, bi, di for i = 0, 1, 2, . . . , m as 
above 

Output: A list L of the indices of gaps to include in the optimal chunky 
representation of / when multiplying by a single chunk of size k 

1 Li ^ 0; ci ^ (5(6o); S'^(0,m + 1) 

2 for ^ = 2, 3, . . . , m + 1 do 
while top pair from S satisfies j < £ or — di > k do 

1^ Remove {i,j) from S 

if top pair from S satisfies Ci + 5{di — di) < ci-i + 5{di — di-i) 
then 



else 



while top pair (i, j) from S satisfies 
Ci + 6{dj — di) > Q„i + 5{dj — di-i) do 
r ^ j 

Remove from S 
if S is empty then 

[_ S* ^ (^- 1,TO+ 1) 

else 

{i,j) ^ top pair from S 
V -i— least index with r < v < j s.t. 
C£-i + S{dy - di^i) > Ci + d{dy - di) 
{i-l,v),S 

Ci + 5{di — di) (where {i,j) is top pair from S) 



20 return Lm+i 



Theorem 2.3. Algorithm 2 returns the optimal chunky representation for mul- 
tiplying f by a dense size-k chunk. The running time of the algorithm is linear 
in the size of the input representation of f . 

Proof. Correctness follows from the discussions above. 

For the complexity analysis, first note that the maximal size of 5, as well 
as the number of saved values ai,bi,di, Si, Li, is m, the number of gaps in /. 
Clearly m is less than the number of nonzero terms in /, so this is bounded above 
by the sparse or dense representation size. If the lists Li are implemented as 
singly-linked lists, sharing nodes, then the total extra storage for the algorithm 
is 0{m). 

The total number of iterations of the two while loops corresponds to the 
number of gaps that are removed from the stack 5* at any step. Since at most 
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one gap is pushed onto S at each step, the total number of removals, and hence 
the total cost of these while loops over all iterations, is 0{m). 

Now consider the cost of Step 17 at each iteration. If the input is given 
in the sparse representation, we just perform a binary search on the interval 
from r to j, for a total cost of 0(m log m) over all iterations. Because m is at 
most the number of nonzero terms in /, mlogm is bounded above by the sparse 
representation size, so the theorem is satisfied for sparse input. 

When the input is given in the dense representation, we also use a binary 
search for Step 17, but we start with a one-sided binary search, or "galloping" 
search, from either r or j, depending on which v is closer to. The cost of this 
search is at a single iteration is 0(log min{w — r,i2 — v}). Notice that the interval 
(r, j) in the stack is then effectively split at the index v, so intuitively whenever 
more work is required through one iteration of this step, the size of intervals is 
reduced, so future iterations should have lower cost. 

More precisely, a loose upper bound in the worst case of the total cost over all 
iterations is 0(^"^-^ 2^ -{u — i + l)), where u = \l0g2 m] . This is less than 2"+-^, 
which is 0{m), giving linear cost in the size of the dense representation. □ 

2.3 Determining the optimal chunk size 

All that remains is to compute the optimal chunk size k that will be used in 
the conversion algorithm from the previous section. This is accomplished by 
finding the value of k that minimizes the cost of multiplying two polynomials 
f-,g€ ^[x], under the restriction that every chunk of / and of g has size k. 

If / is written in the chunky representation as in (2.1), there are many 
possible choices for the number of chunks i, depending on how large the chunks 
are. So define t{k) to be the least number of chunks if each chunk has size at 
most k, i.e. deg/^ < fc for 1 < i < t{k). Similarly define s{k) for g G R[x] 
written as in (2.3). 

Therefore, from the cost of multiplication in Theorem 2.1, in this part we 
want to compute the value of k that minimizes 

t{k) ■ s{k) ■ k ■ S{k). (2.4) 

Say deg / — n. After 0{n) preprocessing work (making pointers to the be- 
ginning and end of each "gap"), t{k) could be computed using 0{n/k) word 
operations, for any value k. This leads to one possible approach to computing 
the value of k that minimizes (2.4) above: simply compute (2.4) for each pos- 
sible k = 1,2,..., maxjdeg /, degg}. This naive approach is too costly for our 
purposes, but underlies the basic idea of our algorithm. 

Rather than explicitly computing each t{k) and s(fc), we essentially maintain 
chunky representations of / and g with all chunks having size less than k, 
starting with k = 1. As k increases, we count the number of chunks in each 
representation, which gives a tight approximation to the actual values of t{k) 
and /(fc), while achieving linear complexity in the size of either the sparse or 
dense representation. 
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To facilitate the "update" step, a minimum priority queue Q (whose specific 
implementation depends on the input polynomial representation) is maintained 
containing all gaps in the current chunky representations of / and g. For each 
gap, the key value (on which the priority queue is ordered) is the size of the 
chunk that would result from merging the two chunks adjacent to the gap into 
a single chunk. 

So for example, if we write / in the chunky representation as 

/ = (4 + Oa; + bx^) ■ x^"^ + (7 + Qx + Ox^ + Oa;^ + 8a;'^) • x^° , 

then the single gap in / will have key value 3 + 35 + 5 = 43, More precisely, if 
/ is written as in (2.1), then the i"^ gap has key value 

deg fi+i + Ci+i -Ci. + l (2.5) 

Each gap in the priority queue also contains pointers to the two (or fewer) 
neighboring gaps in the current chunky representation. Removing a gap from 
the queue corresponds to merging the two chunks adjacent to that gap, so we 
will need to update (by increasing) the key values of any neighboring gaps 
accordingly. 

At each iteration through the main loop in the algorithm, the smallest key 
value in the priority queue is examined, and k is increased to this value. Then 
gaps with key value k are repeatedly removed from the queue until no more 
remain. This means that each remaining gap, if removed, would result in a chunk 
of size strictly greater than k. Finally, we compute 5{k) and an approximation 
of (2.4). 

Since the purpose here is only to compute an optimal chunk size k, and 
not actually to compute chunky representations of / and we do not have to 
maintain chunky representations of the polynomials as the algorithm proceeds, 
but merely counters for the number of chunks in each one. Algorithm 3 gives 
the details of this computation. 

All that remains is the specification of the data structures used to implement 
the priority queues Qf and Qg. If the input polynomials are in the sparse 
representation, we simply use standard binary heaps, which give logarithmic 
cost for each removal and update. Because the exponents in this case are multi- 
precision integers, we might imagine encountering chunk sizes that are larger 
than the largest word-sized integer. But as discussed previously, such a chunk 
size would be meaningless since a dense polynomial with that size cannot be 
represented in memory. So our priority queues may discard any gaps whose 
key value is larger than word-sized. This guarantees all keys in the queues are 
word-size integers, which is necessary for the complexity analysis later. 

If the input polynomials are dense, we need a structure which can perform 
removals and updates in constant time, using 0(deg / -I- deg g) time and space. 
For Qf,we use an array with length deg / of (possibly empty) linked lists, where 
the list at index i in the array contains all elements in the queue with key i. 
(An array of this length is sufficient because each key value in Qj is at least 2 
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Algorithm 3: Optimal Cliunk Size Computation 



Input: /, g G R[a::] 

Output: k E N that minimizes t{k) ■ s{k) ■ k ■ S{k) 

1 Qf,Qg^ minimum priority queues initiahzed with all gaps in / and g, 
respectively 

2 /c, ki^iii i 1, Cniin ^ 

3 while Q f and Qg are not both empty do 

4 
5 
6 



7 

8 

9 
10 
11 



k -s— smallest key value from Qf or Qg 
while Q f has an element with key value < k do 
1^ Remove a fc- valued gap from Q f and update neighbors 

while Qg has an element with key value < k do 
1^ Remove a fc-valued gap from Qg and update neighbors 

^current 

^{\Qf\ + l)-{\Qg\ + l)-k-S{k) 

if ^current ^ ^min then 
[_ ^min ^ ^min ^ ^current 



12 return fc, 
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and at most 1 + deg/.) We use the same data structure for Qg, and this clearly 
gives constant time for each remove and update operation. 

To find the smallest key value in either queue at each iteration through 
Step 4, we simply start at the beginning of the array and search forward in each 
position until a non-empty list is found. Because each queue element update 
only results in the key values increasing, we can start the search at each iteration 
at the point where the previous search ended. Hence the total cost of Step 4 for 
all iterations is 0(deg/ + degg). 

The following lemma proves that our approximations of t{k) and s(fc) are 
reasonably tight, and will be crucial in proving the correctness of the algorithm. 

Lemma 2.4. At any iteration through Step 10 in Algorithm 3, \Q f \ < 2t{k) 
and \Qg\ < 2s{k). 

Proof. First consider /. There are two chunky representations with each chunk 
of degree less than fc to consider: the optimal having t{k) chunks and the one 
implicitly computed by Algorithm 3 with \Qf\ + 1 chunks. Call these / and /, 
respectively. 

We claim that any single chunk of the optimal / contains at most three 
constant terms of chunks in the implicitly-computed /. If this were not so, then 
two chunks in / could be combined to result in a single chunk with degree less 
than fc. But this is impossible, since all such pairs of chunks would already have 
been merged after the completion of Step 5. 

Therefore every chunk in / contains at most two constant terms of distinct 
chunks in /. Since each constant term of a chunk is required to be nonzero, the 
number of chunks in / is at most twice the number in /. Hence jQ/j -f 1 < 2t{k). 
An identical argument for g gives the stated result. □ 
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Now we are ready for the main result of this subsection. 

Theorem 2.5. Algorithm 3 computes a chunk size k such that t[k) ■ s{k) ■k-S{k) 
is at most 4 times the minimum value. The worst-case cost of the algorithm is 
linear in the size of the input representations. 

Proof. If k is the value returned from the algorithm and k* is the value which 
actually minimizes (2.4), the worst that can happen is that the algorithm 
computes the actual value of Cf{k)cg{k)kd{k), but overestimates the value of 
Cf{k*) Cg{k*) k* S{k*). This overestimation can only occur in Cf{k*) and Cg{k*), 
and each of those by only a factor of 2 from Lemma 2.4. So the first statement 
of the theorem holds. 

Write c for the total number of nonzero terms in / and g. The initial sizes 
of the queues Q f and Qg is 0(c). Since gaps are only removed from the queues 
(after they are initialized), the total cost of all queue operations is bounded 
above by 0(c), which in turn is bounded above by the sparse and dense sizes of 
the input polynomials. 

If the input is sparse and we use a binary heap, the cost of each queue 
operation is O(logc), for a total cost of 0(c log c), which is a lower bound on 
the size of the sparse representations. If the input is in the dense representation, 
then each queue operation has constant cost. Since c e 0(deg/ + degg), the 
total cost linear in the size of the dense representation. □ 

2.4 Chunky Multiplication Overview 

Now we are ready to examine the whole process of chunky polynomial conversion 
and multiplication. First we need the following easy corollary of Theorem 2.3. 

Corollary 2.6. Let f e R[a;], A; G N, and f he any chunky representation of 
f where all chunks have degree at least k, and f he the representation returned 
hy Algorithm 2 on input k. The cost of multiplying f by a single chunk of size 
£ < k is then less than the cost of multiplying f by the same chunk. 

Proof. Consider the result of Algorithm 2 on input £. We know from Theo- 
rem 2.3 that this gives the optimal chunky representation for multiplication of 
/ with a size-£ chunk. But the only difference in the algorithm on input £ and 
input k is that more pairs are removed at each iteration on Step 3 on input £. 

This means that every gap included in the representation / is also included 
in the optimal representation. We also know that all chunks in / have degree less 
than fc, so that / must have fewer gaps that are in the optimal representation 
than /. It follows that multiplication of a size-£ chunk by / is more efficient 
than multiplication by /. □ 

To review, the entire process to multiply /, <? G R[a;] using the chunky rep- 
resentation is as follows: 

1. Compute k from Algorithm 3. 
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2. Compute chunky representations of / and g using Algorithm 2 with input 
k. 

3. Muhiply the two chunky representations using Algorithm 1. 

4. Convert the chunky result back to the original representation. 

Because each step is optimal (or within a constant bound of the optimal), 
we expect this approach to yield the most efficient chunky multiplication of / 
and g. In any case, we know it will be at least as efficient as the standard sparse 
or dense algorithm. 

Theorem 2.7. Computing the product of f,g £ R[a:] never uses more ring 
operations than either the standard sparse or dense polynomial multiplication 
algorithms. 

Proof. In Algorithm 3, the values of t{k) ■ s{k) ■ k ■ 6{k) for fc = 1 and k = 
minjdeg /, deg g} correspond to the costs of the standard sparse and dense al- 
gorithms, respectively. Furthermore, it is easy to see that these values are never 
overestimated, meaning that the k returned from the algorithm which minimizes 
this formula gives a cost which is not greater than the cost of either standard 
algorithm. 

Now call / and g the implicit representations from Algorithm 3, and / and 
g the representations returned from Algorithm 2 on input k. We know that the 
multiplication of / by g is more efficient than either standard algorithm from 
above. Since every chunk in g has size fc, multiplying f hy g will have an even 
lower cost, from Theorem 2.3. Finally, since every chunk in / has size at most 
k, Corollary 2.6 tells us that the cost is further reduced by multiplying / by g. 

The proof is complete from the fact that conversion back to cither original 
representation takes linear time in the size of the output. □ 

3 Equal- Spaced Polynomials 

Next we consider an adaptive representation which is in some sense orthogo- 
nal to the chunky representation. This representation will be useful when the 
coefficients of the polynomial are not grouped together into dense chunks, but 
rather when they are spaced evenly apart. 

Let / S R[a:] with degree n, and suppose the exponents of / are all divisible 
by some integer k. Then we can write f = ao + aix^ + a2x'^^ So by letting 

fo ~ o-o + o-ix + a2x'^ + ■ • • , we have f ^ fn ° (where the symbol o indicates 
functional composition) . 

One motivating example suggested by Michael Monagan is that of homoge- 
neous polynomials. Recall that a multivariate polynomial h S R[xi, . . . ,a;„] is 
homogeneous of degree d if every nonzero term of h has total degree d. It is 
well-known that the number of variables in a homogeneous polynomial can be 
effectively reduced by one by writing yi = Xi/xn for 1 < i < n and h — xj^ ■ h, 
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for h G R[yi, ■ ■ • , J/ji-i] an (n — l)-variate polynomial with max-degree d. This 
leads to efficient schemes for homogeneous polynomial arithmetic. 

But this is only possible if (1) the user realizes this structure in their polyno- 
mials, and (2) every polynomial used is homogeneous. Otherwise, a more generic 
approach will be used, such as the Kronecker substitution mentioned in the in- 
troduction. Choosing some integer £ > d, we evaluate h{y,y^,y^ , . . . ), 
and then perform univariate arithmetic over R[y]. But if h is homogeneous, 
a special structure arises: every exponent of y is of the form d + i{£ — 1) for 
some integer i > 0. Therefore we can write h{y, . . . , ) = (ho y^~^) ■ y'^, for 
some h G R[y] with much smaller degree. The algorithms presented in this sec- 
tion will automatically recognize this structure and perform the corresponding 
optimization to arithmetic. 

The key idea is equal-spaced representation, which corresponds to writing a 
polynomial / S R[a;] as 

f^{fDox'')-x'' + fs, (3.1) 

with fc,d e N, /d e R[a;] dense with degree less than n/k — d, and fs £ R[x] 
sparse with degree less than n. The polynomial fs is a "noise" polynomial which 
contains the comparatively few terms in / whose exponents are not of the form 
ik + d for some i > 0. 

Unfortunately, converting a sparse polynomial to the best equal-spaced rep- 
resentation seems to be difficult. To see why this is the case, consider the 
much simpler problem of verifying that a sparse polynomial / can be written 
as (fa o x^) ■ x"^ . For each exponent of a nonzero term in /, this means con- 
firming that Ci = d mod k. But the cost of computing each Ci mod k is roughly 
0(X](logei)i5(logfc)), which is a factor of S{\ogk) greater than the size of the 
input. Since k could be as large as the exponents, we see that even verifying a 
proposed k and d takes too much time for the conversion step. Surely computing 
such a k and d would be even more costly! 

Therefore, for this subsection, we will always assume that the input polyno- 
mials are given in the dense representation. In Section 4, we will see how by 
combining with the chunky representation, we effectively handle equal-spaced 
sparse polynomials without ever having to convert a sparse polynomial directly 
to the equal-spaced representation. 

3.1 Multiplication in the equal-spaced representation 

Let g G R[x] with degree less than m and write g = {gjj o x^) ■ x'^ -\- gs as in 
(3.1). To compute / • g, simply sum up the four pairwise products of terms. All 
these except for the product {fo o x'^) ■ {go o x^) are performed using standard 
sparse multiplication methods. 

Notice that if fc = £, then (/d o x^) ■ {gu ° x^) is simply [fo ■ go) o x'^, and 
hence is efficiently computed using dense multiplication. However, if k and £ 
are relatively prime, then almost any term in the product can be nonzero. 

This indicates that the gcd of k and £ is very significant. Write r and s for 
the greatest common divisor and least common multiple of k and £, respectively. 
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To multiply (/d o x'') by {qd o x^), we perforin a transformation similar to the 
process of finding common denominators in the addition of fractions. First split 
fn o x'' into s/k (or £/r) polynomials, each with degree less than n/s and right 
composition factor X J clS follows: 

fDOx'' = ifo O + (/l O X') ■ x'^ + (/2 Ox')-X^''--- + O X^ ■ X'-'' 

Similarly split gD°x^ into s/£ polynomials go, gi, . . . , gg/e-i with degrees less 
than m/s and right composition factor x*. Then compute all pairwise products 
fi ■ gj , and combine them appropriately to compute the total sum (which will 
be equal-spaced with right composition factor x"^). 

Algorithm 4 gives the details of this method. 



Algorithm 4: Equal Spaced Multiplication 

Input: } ={fD° x^) ■x'^ + fs, g = {go ° x'^) ■ x" + gs, 
with /n = flo + o-ix + a2X^ + ■ ■ ■ , go ~ h + bix + h2X^ + • • • 
Output: The product / • g 

1 r^gcd(fc,^), s^lcm(/c,£) 

2 for i = 0, 1, . . . , s/fc — 1 do 

3 |_ /i -S- flj + as+^x + a2s+iX'^ H 

4 for i = 0, l,...,s/^- 1 do 

5 |_ 5,; fei + bs+iX + b2s+ix'^ H 

6 /lu -S— 

7 for i = 0, 1, . . . , s/fc — 1 do 

8 for — 0, 1, . . . , s/^ — 1 do 

9 Compute fi ■ gj by dense multiplication 

10 \^hD^hD + ■ gj) o x') ■ x'^'+J' 

11 Compute {fo o x'') ■ gs, {go ° x^) ■ fs, and fs ■ gs by sparse multiphcation 

12 return ho ■ 2;"+'* + {fa o a:*^) • • 2^"' + {go o x^) ■ fs ■ x'' + fs ■ gs 



As with chunky multiplication, this final product is easily converted to the 
standard dense representation in linear time. The following theorem gives the 
complexity analysis for equal-spaced multiplication. 

Theorem 3.1. Let f,g be as above such that n > m, and write tf,tg for the 
number of nonzero terms in fs and gs, respectively. Then Algorithm 4 correctly 
computes the product f ■ g using 

0{{n/r) ■ 5{m/s) + ntg/k + mtf /£ + tjtg) 

ring operations. 

Proof. Correctness follows from the preceding discussion. 

The polynomials fn and gjj have at most n/k and m/i nonzero terms, 
respectively. So the cost of computing the three products in Step 11 by using 
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standard sparse multiplication is 0{ntg/k + mtf /£+tftg) ring operations, giving 
the last three terms in the complexity measure. 

The initialization in Steps 2-5 and the additions in Steps 10 and 12 all have 
cost bounded by 0(n/r), and hence do not dominate the complexity. 

All that remains is the cost of computing each product fi ■ gj by dense multi- 
plication on Step 9. From the discussion above, deg/i < n/s and deg gj < rn/s, 
for each i and j. Since n > m, {n/s) > {m/s), and therefore this product can 
be computed using 0{{n/ s)S{m/ s)) ring operations. The number of iterations 
through Step 9 is exactly {s/k){s/t). But s/i — k/r, so the number of iterations 
is just s/r. Hence the total cost for this step is 0{{n/r)S{m/s)), which gives 
the first term in the complexity measure. □ 

It is worth noting that no additions of ring elements are actually performed 
through each iteration of Step 10. The proof is as follows. If any additions were 
performed, we would have 

ilk + ji£ = i2k + mods 

for distinct pairs and («2i ^2)- Without loss of generality, assume ii ^ 

and write 

for some q ^TL. Rearranging gives 

(ii -i2)k = (^2 - ji)^ + 9S. 

Because £|s, the left hand side is a multiple of both fc and and therefore by 
definition must be a multiple of s, their 1cm. Since < ii, 12 < s/k, |«i — 12] < 
s/fc, and therefore — i2)k\ < s. The only multiple of s with this property is 
of course 0, and since fc 7^ this means that ii = 12, a contradiction. 

The following theorem compares the cost of equal-spaced multiplication to 
standard dense multiplication, and will be used to guide the approach to con- 
version below. 

Theorem 3.2. Let f,g,m,n,tf,tg be as before. Algorithm 4 does not use 
asymptotically more ring operations than standard dense multiplication to com- 
pute the product of f and g as long as tf G 0{5{n)) and tg G 0{5{m)). 

Proof. Assuming again that n > m, the cost of standard dense multiplication 
is 0{nd(rn)) ring operations, which is the same as 0{nS{m) + mS{n)). 

Using the previous theorem, the number of ring operations used by Algo- 
rithm 4 is 

O {{n/r)5{m/s) + nS{m)/k + mS{n)/e + S{n)S{m)) . 

Because all of k,i, r, s are at least 1, and since 5{n) < n, every term in this 
complexity measure is bounded by n6{m)+mS{n). The stated result follows. □ 
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3.2 Converting to equal-spaced 

The only question when converting a polynomial / to the equal-spaced repre- 
sentation is how large we should allow ts (the number of nonzero terms in of 
fs) to be. From Theorem 3.2 above, clearly we need ts € 5(deg/), but we can 
see from the proof of the theorem that having this bound be tight will often 
give performance that is equal to the standard dense method (not worse, but 
not better either). 

Let t be the number of nonzero terms in /. Since the goal of any adaptive 
method is to in fact be faster than the standard algorithms, we use the lower 
bound of 8{n) £ il(logn) and t < degf + 1 and require that ts < logj t. 

As usual, let / € R[x] with degree less than n and write 

/ — aix"^^ + a2X^^ + • • • -I- fliX^* , 

with each e R \ {0}. The reader will recall that this corresponds to the sparse 
representation of /, but keep in mind that we are assuming / is given in the 
dense representation; / is written this way only for notational convenience. 

The conversion problem is then to find the largest possible value of k such 
that all but at most log2 t of the exponents Cj can be written as ki + d, for any 
nonnegative integer i and a fixed integer d. Our approach to computing k and 
d will be simply to check each possible value of fc, in decreasing order. To make 
this efficient, we need a bound on the size of k. 

Lemma 3.3. Let n £ N and ei, . . . , be distinct integers in the range [0, n\. 
If at least t — log2 t of the integers ei are congruent to the same value modulo k, 
for some k £N, then 

n 

If < 

~ t~2 log2 t-1' 

Proof. Without loss of generality, order the e^'s so that < ei < 62 < • • • < 
e* < n. Now consider the telescoping sum (e2 — ei) -f (ea — 62) + ■ ■ ■ + (et — et_i). 
Every term in the sum is at least 1, and the total is et — ei, which is at most n. 

Let S C {d , . . . , 64} be the set of at most log2 t integers not congruent to the 
others modulo k. Then for any ei,ej ^ S, ei = Cj mod k. Therefore k\{ej — Ci). 
If J > i, this means that ej — ei > k. 

Returning to the telescoping sum above, each ej € 5* is in at most two of 
the sum terms — e^-i. So all but at most 21og2 t of the terms are at least k. 
Since there are exactly t—1 terms, and the total sum is at most n, we conclude 
that {t — 21og2 t — 1) ■ k < n. The stated result follows. □ 

We now employ this lemma to develop an algorithm to determine the best 
values of k and d, given a dense polynomial /. Starting from the largest possible 
value from the bound, for each candidate value fc, we compute each e; mod /c, 
and find the majority element — that is, a common modular image of more 
than half of the exponents. 

To compute the majority element, we use a now well-known approach first 
credited to Boycr and Moore [1981] and Fischer and Salzberg [1982]. Intuitively, 
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pairs of different elements are repeatedly removed until only one element re- 
mains. If there is a majority element, this remaining element is it; only one 
extra pass through the elements is required to check whether this is the case. 
In practice, this is accomplished without actually modifying the list. 



Algorithm 5: Equal Spaced Conversion 



Input; Exponents ei, 62, . . . , € N and n € N such that 

< ei < 62 < • ■ ■ < et = n 
Output: k,d gN and 5* C {ei, . . . , et} such that ei = d mod k for all 
exponents not in S*, and \S\ < log2 t. 

1 if t < 32 then k ^ n 

2 else k -s— [n/{t — 1 — 2 logj t)J 

3 while k > 2 do 

4 d ^ ei mod fc; j ^ 1 

5 for i = 2,3, . . . ,t do 

6 if Ci = d mod k then i ^ :/ + 1 

7 else if j > then j ^ j — 1 

8 else d 6; mod fc; j ^ 1 

9 S* -S— {ei : ei ^ d mod fc} 

10 if jS*! < log2 t then return fc, d, S' 

11 fc fc — 1 



12 return 1, 0, 



Given k,d, S from the algorithm, in one more pass through the input poly- 
nomial, and fs are constructed such that / = (/d o x^) ■ x'^ + fs- After 
performing separate conversions for two polynomials f,g € R[x], they are mul- 
tiplied using Algorithm 4. 

The following theorem proves correctness when t > 4. If t < 4, we can always 
trivially set fc = Ct — ei and d = ei mod fc to satisfy the stated conditions. 

Theorem 3.4. Given integers ei, . . . ,et and n, with t > A, Algorithm 5 com- 
putes the largest integer k such that at least t — log2 t of the integers Ci are 
congruent modulo fc, and uses 0{n) word operations. 

Proof. In a single iteration through the while loop, we compute the majority 
element of the set {e^ mod fc : i = 1, 2, . . . , t}, if there is one. Because t > 4, 
\0g2t <t/2. Therefore any element which occurs at least t — \0g2t times in a 
f-element set is a majority element, which proves that any fc returned by the 
algorithm is such that at least t — \og2 t of the integers are congruent modulo 
fc. 

From Lemma 3.3, we know that the initial value of fc on Step 1 or 2 is greater 
than the optimal fc value. Since we start at this value and decrement to 1, the 
largest fc satisfying the stated conditions is returned. 

For the complexity analysis, first consider the cost of a single iteration 
through the main while loop. Since each integer Ci is word-sized, computing 
each Ci mod fc has constant cost, and this happens 0{t) times in each iteration. 
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li t < 32, each of the 0{n) iterations has constant cost, for total cost 0{n). 

Otherwise, we start with k = \ nj (t — 1 — 2 log2 €)\ and decrement. Because 
t > 32, t/2 > l+21og2i. Therefore (t-l-21og2i) > t/2, so the initial value of fc 
is less than 2n/t. This gives an upper bound on the number of iterations through 
the virhile loop, and so the total cost is 0{n) word operations, as required. □ 

Algorithm 5 can be implemented using only 0{t) space for the storage of 
the exponents ei, . . . , et, which is linear in the size of the output, plus the space 
required for the returned set S. 

4 Chunks with Equal Spacing 

The next question is whether the ideas of chunky and equal-spaced polynomial 
multiplication can be effectively combined into a single algorithm. As before, we 
seek an adaptive combination of previous algorithms, so that the combination 
is never asymptotically worse than either original idea. 

An obvious approach would be to first perform chunky polynomial conver- 
sion, and then equal-spaced conversion on each of the dense chunks. Unfortu- 
nately, this would be asymptotically less efficient than equal-spaced multiplica- 
tion alone in a family of instances, and therefore is not acceptable as a proper 
adaptive algorithm. 

The algorithm presented here does in fact perform chunky conversion first, 
but instead of performing equal-spaced conversion on each dense chunk indepen- 
dently. Algorithm 5 is run simultaneously on all chunks in order to determine, 
for each polynomial, a single spacing parameter k that will be used for every 
chunk. 

Let / = /ix^i + f2X^'^ + • • • + ftx'^* in the optimal chunky representation for 
multiplication by another polynomial g. We first compute the smallest bound 
on the spacing parameter k for any of the chunks fi, using Lemma 3.3. Starting 
with this value, we execute the while loop of Algorithm 5 for each polynomial 
fi, stopping at the largest value of k such that the total size of all sets S on 
Step 9 for all chunks fi is at most logj t / , where i/ is the total number of nonzero 
terms in /. 

The polynomial / can then be rewritten (recycling the variables fi and e^) 

as 

/ = (/l O x'^) ■ X^' + (/2 o x'^) • + • • • + {ft o x'') ■ x'^' + fs, 

where fs is in the sparse representation and has 0{logtf) nonzero terms. 

Let k* be the value returned from Algorithm 5 on input of the entire poly- 
nomial /. Using k* instead of k, f could still be written as above with fs 
having at most log2 i/ terms. Therefore the value of k computed in this way is 
always greater than or equal to k* if the initial bounds are correct. This will 
be the case except when every chunk fi has few nonzero terms (and therefore 
t is close to tf). However, this reduces to the problem of converting a sparse 
polynomial to the equal-spaced representation, which seems to be intractable. 
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as discussed above. So our cost analysis will be predicated on the assumption 
that the computed value of k is never smaller than k* . 

We perform the same equal-spaced conversion for g, and then use Algo- 
rithm 1 to compute the product / • g, with the difference that each product 
fi ■ gj is computed by Algorithm 4 rather than standard dense multiplication. 
As with equal-spaced multiplication, the products involving fs or gs are per- 
formed using standard sparse multiplication. 

Theorem 4.1. The algorithm described above to multiply polynomials with 
equal-spaced chunks never uses more ring operations than either chunky or equal- 
spaced multiplication, provided that the computed "spacing parameters" k and £ 
are not smaller than the values returned from Algorithm 5. 

Proof. Let n, to be the degrees of /, g respectively and write tf,tg for the number 
of nonzero terms in /, g respectively. The sparse multiplications involving fs 
and gs use a total of tg logt/ +tf log tg + {\ogtf){\ogtg) ring operations. Both 
the chunky or equal-spaced multiplication algorithms always require 0{tgS(tf)-\- 
tf5{tg)) ring operations in the best case, and since 5{n) G il{logn), the cost of 
these sparse multiplications is never more than the cost of the standard chunky 
or equal-spaced method. 

The remaining computation is that to compute each product fi ■ gj using 
equal-spaced multiplication. Write k and £ for the powers of x in the right 
composition factors of / and g respectively. Theorem 3.1 tells us that the cost 
of computing each of these products by equal-spaced multiplication is never 
more than computing them by standard dense multiplication, since k and £ are 
both at least 1. Therefore the combined approach is never more costly than just 
performing chunky multiplication. 

To compare with the cost of equal-spaced multiplication, assume that k and 
£ are the actual values returned by Algorithm 5 on input / and g. This is the 
worst case, since we have assumed that k and £ are never smaller than the values 
from Algorithm 5. 

Now consider the cost of multiplication by a single equal-spaced chunk of 
g. This is the same as assuming g consists of only one equal-spaced chunk. 
Write di = deg/i for each equal-spaced chunk of /, and r,s for the gcd and 
1cm of k and £, respectively. If to > n, then of course to is larger than each 
di, so multiplication using the combined method will use 0{{m/r)^5{di/ s)) 
ring operations, compared to 0{{m/r)6{n/s)) for the standard equal-spaced 
algorithm, by Theorem 3.1. 

Now recall the cost equation (2.4) used for Algorithm 3: 

Cfib)-Cgib)-b-Sib), 

where b is the size of all dense chunks in / and g. By definition, Cf{n) — 1, 
and Cg{n) < m/n, so we know that Cf{n)cg{n)nS{n) < mS{n). Because the 
chunk sizes di were originally chosen by Algorithm 3, we must therefore have 
TO^*^j^5((ii) < m5{n). The restriction that the S function grows more slowly 
than linear then implies that {m/r)'Y^6{di/ s) E 0{{m/r)d{n/s)), and so the 
standard equal-spaced algorithm is never more efhcient in this case. 
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When m < n, the number of ring operations to compute the product using 
the combined method, again by Theorem 3.1, is 

O I 6{m/s) J2 idjr) + (m/r) ^ 6{dJ s) j , (4.1) 

compared with 0{{n/r)5{m/ s)) for the standard equal-spaced algorithm. Be- 
cause we always have X]i=i "^i — ^^e first term of (4.1) is 0((n/r)5{m/ s)). 
Using again the inequality rnY^\^-^^5{di) < m5{n), along with the fact that 
m5{n) S 0{n5{m)) because m < n, we see that the second term of (4.1) is also 
0{{n/r)S{m/ s)). Therefore the cost of the combined method is never more than 
the cost of equal-spaced multiplication alone. □ 



5 Conclusions and Future Work 

Two methods for adaptive polynomial multiplication have been given where we 
can compute optimal representations (under some set of restrictions) in linear 
time in the size of the input. Combining these two ideas into one algorithm 
inherently captures both measures of difficulty, and will in fact have significantly 
better performance than either the chunky or equal-spaced algorithm in many 
cases. 

However, converting a sparse polynomial to the equal-spaced representation 
in linear time is still out of reach, and this problem is the source of the restriction 
of Theorem 4.1. Some justification for the impossibility of such a conversion 
algorithm was given, due to the fact that the exponents could be long integers. 
However, we still do not have an algorithm for sparse polynomial to equal-spaced 
conversion under the (probably reasonable) restriction that all exponents be 
word-sized integers. A linear-time algorithm for this problem would be useful 
and would make our adaptive approach more complete, though slightly more 
restricted in scope. 

Some early results from a trial implementation indicate that the algorithms 
we present are quite good at computing efficient adaptive representations, even 
in the presence of "noise" in the input polynomials, and although the conversion 
does sometimes have a measurable cost, it is almost always significantly less 
than the cost of the actual multiplication. Some of these results were reported 
in [Roche, 2008], giving evidence that our theoretical results hold in practice, 
but more work on an efficient implementation is still needed. 

Yet another area for further development is multivariate polynomials. We 
have mentioned the usefulness of Kronecker substitution, but developing an 
adaptive algorithm to choose the optimal variable ordering would give significant 
improvements . 

Finally, even though we have proven that our algorithms produce optimal 
adaptive representations, it is always under some restriction of the way that 
choice is made (for example, requiring to choose an "optimal chunk size" k 
first, and then compute optimal conversions given k). These results would be 
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significantly strengthened by proving lower bounds over all available adaptive 
representations of a certain type, but such results have thus far been elusive. 
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